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Abstract 

While much work has explored probabilistic 
graphical models for independent data, less at¬ 
tention has been paid to time series. The goal 
in this setting is to determine conditional inde¬ 
pendence relations between entire time series, 
which for stationary series, are encoded by zeros 
in the inverse spectral density matrix. We take a 
Bayesian approach to structure learning, placing 
priors on (i) the graph structure and (ii) spectral 
matrices given the graph. We leverage a Whittle 
likelihood approximation and dehne a conjugate 
prior—the hyper complex inverse Wishart —on 
the complex-valued and graph-constrained spec¬ 
tral matrices. Due to conjugacy, we can ana¬ 
lytically marginalize the spectral matrices and 
obtain a closed-form marginal likelihood of the 
time series given a graph. Importantly, our an¬ 
alytic marginal likelihood allows us to avoid in¬ 
ference of the complex spectral matrices them¬ 
selves and places us back into the framework of 
standard (Bayesian) structure learning. In partic¬ 
ular, combining this marginal likelihood with our 
graph prior leads to efficient inference of the time 
series graph itself, which we base on a stochastic 
search procedure, though any standard approach 
can be straightforwardly modified to our time se¬ 
ries case. We demonstrate our methods on ana¬ 
lyzing stock data and neuroimaging data of brain 
activity during various auditory tasks. 

1 INTRODUCTION 

Probabilistic graphical models (PGMs)—which compactly 
encode a set of conditional independence statements—have 
become a defacto tool for defining probabilistic models 
over large sets of random variables. When faced with time 
series, dynamic Bayesian networks (DBNs) are commonly 
deployed and specify sparse between- and within-time de¬ 


pendencies, often encoded by a template model replicated 
across time to straightforwardly model the growing set of 
random variables m. Learning template models requires 
specifying the set of dependency lags to be considered 
El a. In many applications, one instead aims to infer 
conditional independence between entire data streams ac¬ 
counting for interactions at all possible lags, represented by 
a time series graphical model (TGM). For example, imag¬ 
ine recording brain activity from multiple regions of the 
brain over time. Inference of a TGM in this setting would 
provide insight into the functional connectivity of differ¬ 
ent brain regions, an object of substantial scientific interest 
ilia. TGMs have also been applied to intensive care mon¬ 
itoring la and hnancial time series Q. 

The pioneering work of Dahlhaus ISl introduced the con¬ 
cept of undirected graphical models for stationary time se¬ 
ries. The key insight was to transform the series to fre¬ 
quency domain and express the graph relationships in the 
resulting spectral representation. For jointly Gaussian sta¬ 
tionary time series, Dahlhaus showed that conditional 
independencies between time series are encoded by zeros 
in the inverse spectral density matrices. This result is the 
frequency-domain analog to Gaussian graphical modeling 
in the i.i.d. (non-time-series) setting, where zeros in the 
inverse covariance matrix, or precision matrix, encode the 
conditional independencies between observed dimensions 
El- Dahlhaus’ insight was hrst exploited to perform inde¬ 
pendent hypothesis tests of conditional independence be¬ 
tween each pair of time series il, with more recent work 
correcting for multiple comparisons nniini. 

A likelihood-based approach leveraging the Whittle ap¬ 
proximation ca has also been introduced Cl- The Whit¬ 
tle approximation casts the likelihood in the frequency do¬ 
main with terms depending on the spectral density matri¬ 
ces critical to TGM structure learning, and independently 
so across frequencies. One approach scores graphs using 
AIC lfT3]l . A recent penalized likelihood variant lfT4ll places 
a joint graphical lasso na across frequencies to enforce a 
common zero pattern in the spectral density matrices. A 
penalized likelihood approach restricted to hnite vector au- 




toregressive processes has also been considered ||2l- 

We instead consider a Bayesian approach to TGM structure 
learning, with all the benefits garnered from the Bayesian 
paradigm, including modeling within a generative frame¬ 
work where information from multiple sources can inte¬ 
grated and combined with available prior knowledge. For 
example, neural data are notoriously noisy, and robust in¬ 
ferences often rely on integrating time series across mul¬ 
tiple trials and individuals or recording platforms (e.g., 
EEG/MEG). Our approach also leverages the Whittle like¬ 
lihood. We then introduce a novel hyper Markov law M, 
the hyper complex inverse Wishart distribution, that serves 
as a conjugate prior for the spectral density matrices whose 
inverses have a zero pattern specified by a graph. Eor 
decomposable graphs, this formulation leads to a closed- 
form expression for the marginal likelihood of a multivari¬ 
ate time series given a graph. By placing a prior on graph 
structures, we achieve a fully Bayesian approach to TGM 
structure learning for stationary time series. Eor our graph 
prior, we consider a multiplicity correcting prior QT). Our 
analytic expression for the marginal likelihood is critical to 
the practicality of our approach since we can avoid infer¬ 
ence of the large set of high-dimensional, complex spec¬ 
tral density matrices. In particular, for a length T series of 
dimension p, there Sire T p x p spectral matrices to con¬ 
sider. In the i.i.d. setting, inference of just a single p x p 
graph-constrained covariance matrix is challenging; in this 
setting, inference of the T p x p matrices is prohibitive. 

Hyper Markov laws based on the hyper inverse Wishart are 
a popular tool for Bayesian graphical model selection in 
the i.i.d. setting lfT8l[T9ll . Indeed, many powerful Bayesian 
structure learning algorithms based on this framework have 
been developed, both for decomposable EOl l2Tll and non- 
decomposable Il22ll2^ graphs. By framing TGM structure 
learning in this common framework, we are able to apply 
existing state-of-the-art inference machinery for standard 
structure learning to the time series case. In this paper we 
use the feature-inclusion stochastic search (EINCS) proce¬ 
dure EOl for inference in decomposable models; however, 
many other MCMC and search schemes may be used. Im¬ 
portantly, future computational advances in Bayesian in¬ 
ference for i.i.d. graphical models may be easily extended 
using our framework to the time series case. 

We test our methods on data simulated from vector autore¬ 
gressive models with randomly generated TGMs. Our ap¬ 
proach reaches almost perfect TGM recovery as the length 
of the time series or number of independent replicates in¬ 
creases. We then demonstrate the utility of our methods on 
a global stock indices dataset and MEG neuroimaging data 
of auditory attention switching tasks. In both cases we find 
meaningful, intuitive structure in the data. 

Our paper is organized as follows. We provide background 
on graphical models and stationary time series in Sec. 


Our proposed TGM method is in Sec.|^ first introduced in 
the context of multiple independent realizations and then 
adapted to perform efficient inference of the TGM from 
only a single realization. In Sec.[^ we discuss how existing 
Bayesian structure learning methods may be modified to fit 
our formulation. Simulated results are in Sec. with our 
stock and MEG analyses in Secs.[7]an di respectively. 


2 BACKGROUND 

2.1 Graphs 

Let G = (y, E) be an undirected graph with vertex set 

V = {1,... ,p} and edge set E, where E C £ 

V X V : i ^ j}. Nodes i and j are adjacent, or neighbors, 
if {i, j) G E. A complete graph is one having (f, j) G E for 
every i,j G V and complete subgraphs C C E are termed 
cliques. A triple of subgraphs {A, S, B) where V = AU B 
and S = A D B with S complete is called a decomposi¬ 
tion if every path from a node in A to a node in B must 
pass through S, the separator. Recursively decomposing 
A and B in this fashion results in the prime components 
of a graph. If the prime components are complete then the 
graph is decomposable. We let the sets C = {Ci,..., Ck} 
and S = {S 2 ,.. ■, Sk} each denote the prime components 
and their separators, respectively, generated by the decom¬ 
position. Eor simplicity, we restrict our attention to decom¬ 
posable graphs but stress that our formulation is extensible 
to the non-decomposable case (see Sec.[^. 


2.2 Hyper Markov distributions 


Eor a given set of of random variables X, with realization 
X G X, dimensionality p, and joint density p{x), an undi¬ 
rected graphical model G can be constructed by stating that 
an edge {i,j) ^ E\f Xi and Xj are conditionally indepen¬ 
dent given the remaining variables, i.e. Xj _LL 
where Zij = V \ {i,j}. If the graph is decomposable, the 
joint density decomposes over cliques and separators: 


p(x) 


ncgcp(^c) 


( 1 ) 


where p{xa) for A <gV denotes the marginal distribution 
of the set of variables xa- 

A hyper Markov law M is a distribution over probabil¬ 
ity measures that is concentrated on distributions that obey 
the Markov properties specified by G. Examples include 
the hyper Wishart and hyper Dirichlet distribution ifT^fTSl . 
Such distributions have proven pivotal in Bayesian graph¬ 
ical modeling by serving as conjugate priors for the graph 
parameters conditioned on the graph structure G. Eor ex¬ 
ample, in Gaussian graphical models (GGMs), the hyper 
inverse Wishart distribution provides a conjugate prior for 
covariance matrices that obey a zero pattern in the pre¬ 
cision, as specified by G. By integrating over the hyper 



Markov distribution, one can obtain the marginal likeli¬ 
hood of the data conditioned on the structure G alone. 

2.3 Stationary time series 

Let X{t) = (Xi(t),S for t S Z be a 
multivariate Gaussian stationary time series such that: 


marginal likelihood of the time series given the graph struc¬ 
ture, G, allowing us to straightforwardly leverage state-of- 
the-art computational methods for i.i.d. Bayesian structure 
learning. 

3.1 Whittle likelihood 


E{X{t)) (2) 

Cov{x{t),x{t-^h)) = r{h) Vf,/iez. (3) 

A time series probabilistic graphical model (TGM), G = 
{V,E), may be constructed by letting {i,j) 4- ^ denote 
that the entire time series Xi{-.) and Xj{-.) are conditionally 
independent given the remaining collection of time series 
Xzi where Zij = 1^ \ {*, j}- For the Gaussian stationary 
series we consider, one can show that conditional indepen¬ 
dence holds between time series iff H 


Coy{X,{t),Xj{t + h)\Xz,,) =0 V/i e Z. (4) 

The spectral density matrix of a stationary time series is 
defined as the Fourier transform of the lagged covariance 
matrices, r(/i) = Cov{X(t), X{th)): 

OO 

5'(A) = r(h)e-^^^ (5) 

h— — c>C) 

for A G [0, 27r] and S{\) G and Hermitian positive 
definite. The marginal dependencies between time series 
are captured by S{X), and from Eq. (|^, S{X)ij = 0 for 
all A G [0,27r] iff T{h)ij = 0 for all h G Z. Further¬ 
more, conditional independence between Gaussian station¬ 
ary time series holds iff 

.S(A)-.i = 0 VAg[0,2^], ( 6 ) 

implying that inferring zeros in the inverse spectral den¬ 
sity matrices across frequencies equates with inferring the 
TGM structure ||8l. More background on the spectral ap¬ 
proach to time series is presented in the Supplement. 

3 A BAYESIAN APPROACH 

There are two standard approaches to Bayesian inference 
in graphical models: ( 1 ) placing a prior that jointly spec¬ 
ifies the graph structure and associated parameters or ( 2 ) 
placing a prior on graph structures and then a prior on pa¬ 
rameters given a graph; both rely on specifying a likelihood 
model. We opt for the second approach and describe the 
various components in this section. At a high level, our 
methods combine existing Whittle likelihood based meth¬ 
ods iniiii with the hyper Markov framework to Bayesian 
graphical modeling iniiisi. In the context of our TGMs, 
we introduce a conjugate hyper complex inverse Wishart 
prior on graph-constrained spectral density matrices. By 
integrating out the spectral density matrices, we obtain a 


Let X = [X(l),..., Ar(T)], with x{t) G MP a realiza¬ 
tion of a p-dimensional stationary Gaussian time series ob¬ 
served at T time points, and Xi:jv = • 7 X^} be 

the collection of N independent realizations. We move to 
the frequency domain by transforming each X® using a dis¬ 
crete Fourier transform. Let dnk G denote the discrete 
Fourier coefficient associated with the nth time series at 
frequency Xk = ^: 

1 

dnk = Xn{t)e (7) 

t^o 

The Whittle approximation ifT^ assumes the Fourier coef¬ 
ficients are independent complex normal random variables 
with mean zero and covariance given by the corresponding 
spectral density matrix Sk = S{Xk)' 


dnk ^ XfciO, Sk) k = 0,...,T-l, 
such that the likelihood of Xi-jv is approximated as 


N T-1 


P(Xl:Ar|S'o:T-l) ~ n n 


n—1 k—0 


1 

7rP|5'fc|* 


( 8 ) 

(9) 


where e ^ is the density of a complex normal 
distribution, A/’c(0, S), with S G and Hermitian posi¬ 
tive definite. See the Supplement. The Whittle approxima¬ 
tion holds asymptotically with large T ||24l |25] [El. This 
approximation has been used in the Bayesian context in 

ESI Ell 


Recall that conditional independencies are encoded in the 
off diagonal elements of If time series Xi{t) and 

Xj (t) are conditionally independent, then the Whittle ap¬ 
proximation says that as T gets large the ith and jth ele¬ 
ments of the Fourier coefficients dnk are conditional inde¬ 
pendent across all frequencies. Thus, if G is decomposable, 
Eq. (j^ can be rewritten as 


P(Xi:^|G,50:(T-1)) 


( 10 ) 


CgC 


^-aPkcSG 


n. _ 

A7 = 0 11SG5 7I-JV|S||S^g|N' 


a—iiPksSf. 


where 


N 

Pk = Ydnkd*nk ( 11 ) 

n—1 


is the aggregate periodogram over the N time series at fre¬ 
quency Eor A G V, SkA and PkA are the restriction 
of both matrices to the elements in A and |Al| denotes the 
cardinality of the set A. 







3.2 Hyper complex inverse Wishart prior on 
graph-constrained spectral density matrices 


We seek a prior for the spectral density matrices, Sk, whose 
inverses each have zeros dictated by a graph G. Recall that 
these Sk matrices are complex-valued and restricted to be 
Hermitian positive definite. As discussed in Sec. 2.2 the 
hyper inverse Wishart distribution serves as a prior for real¬ 
valued, positive-definite matrices with pre-specified zeros 
in the inverse, and is a conjugate prior for the covariance 
of a zero-mean GGM. Motivated by the connection be¬ 
tween GGMs and our TGMs, and the analogous structure 
of our TGM-based Whittle likelihood of Eq. ( [T0| ) to that 
of a GGM with N i.i.d. observations, we propose a novel 
hyper complex inverse Wishart prior with density function 


p{Ti\5,W,G) =(x 1 i;gm+(G)|S 


-{s+2p) 


( 12 ) 


for degrees of freedom (5 > 0, scale matrix W S 
positive definite and Hermitian, and graph G. We have 
used an analogous parameterization to that of the hyper in¬ 
verse Wishart ifThll . Here, S G M+jG) denotes that E is 
in the set of all Hermitian positive-definite matrices with 
(E“^)^^. = 0 for all {i,j) ^ E. When G is decomposable, 
the normalization constant is available and the density de¬ 
composes over cliques and separators; 


p(S|(5,fE,G) 


Uc^c'^^c{Ec\S,Wc) 

rises IW.(Ec|5,H^c) 

rises ’ 

(14) 


where IWc denotes the complex inverse Wishart ll25l de¬ 
tailed in the Supplement with normalizer 


B{W,S) 


\w\^+p 




(15) 


We denote our proposed prior as HIWc{6, W, G) and spec¬ 
ify 


Sk\G^HIW,{Sk,Wk,G) k = 0,...,T-l. (16) 


In the Supplement, we show that this prior specification is 
conjugate to the TGM-based Whittle likelihood of Eq. ( [TOl i. 
Also note that the graph, G, is shared across all frequencies. 


Here, 5* = 6k + N, W* = Wk + Pk, and 


h{W, 6, G) 


Y{c^cB{Wc,6) 

ns6sS(^s7^)' 


( 18 ) 


Erom the definition of 5j., we see that N, the number of 
time series, acts as the effective number of observations in 
this case. Eor the i.i.d. GGM, N represents the number 
of independent vector-valued observations; in our TGM, 
N plays the same role, but represents the number of inde¬ 
pendent time series observations. Likewise, as in standard 
inverse Wishart based modeling of covariances for i.i.d. 
Gaussian data, based on a set of N i.i.d. complex normal 
observations of Eourier coefficients dnk with covariance Sk 
(see Eq. (|^), we update the prior scale matrix Wk with the 
outer product Pk = Yln=i ^rikd*^k- which is the aggregate 
periodogram (see Eq. O)- 

Having an analytic marginal likelihood of the time series 
given a PGM allows us to perform inference directly over 
graphs, sidestepping any thorny issues with inference di¬ 
rectly on the T p X p spectral density matrices themselves. 
This is a critical feature of the practicality of our approach. 


3.4 Fractional priors for model selection 

Marginal likelihoods used for model comparison 12^ are 
notoriously sensitive to the choice of prior parameters, or 
hyperparameters. In our case, the marginal likelihood in 
Eq. ( [I4| ) depends strongly on the hyper complex inverse 
Wishart scale matrix, Wk- Since the scale and shape of the 
spectral density matrices are not known a priori, and vary 
dramatically across frequencies, we employ/racfional pri¬ 
ors ll29l over each Sk- Eractional priors effectively hold out 
some fraction of the data, and utilize that fraction to deter¬ 
mine an adequate hyperparameter setting for each model. 
The rest of the data are then used for model comparison. 
Eractional priors have been deployed for graphical model 
selection in i.i.d. graphs and have a number of desirable 
properties such as information consistency and demon¬ 
strated robustness j^ . In our case, under a fractional prior 
with parameter g G (0,1), the fractional marginal likeli¬ 
hood is 


p(Xi:^|5,G) 


.-NTp JT h{gPk,gN,G) 
h{Pk,N,G) 


(19) 


3.3 Marginal likelihood 


Due to conjugacy of our proposed hyper complex inverse 
Wishart prior, the marginal likelihood of the time series 
Xi:Ar given a decomposable graph G, integrating out the 
spectral density matrices So-,t-i, has a closed form which 
is derived in the Supplement and given by 


p(Xi:Jv|G) 


T-1 

-NTp 

fc=0 


h{Wk,6k,G) 

h{w:,si,Gy 


(17) 


Here, we see that g controls the fraction of data used for 
prior formulation versus model comparison. Importantly, 
we now have just a single, scalar, and interpretable param¬ 
eter g to tune. Default settings are suggested in ll29ll^ . 

3.5 Graph prior 

There are two common approaches in the literature to spec¬ 
ifying a prior distribution on graphs. The first approach 
places a uniform distribution on the space of all possible 












graphs ifTSlI^ISTI . As noted in Il32l . this prior puts high 
weight on graphs with a medium number of edges and sig¬ 
nificantly less weight on graphs with small or many edges. 
In response to this problem, it has been proposed to place 
a prior directly on the size of the graph and then consider a 
conditionally uniform prior on all graphs of the same size 
EH Esin. We follow this later approach and place a bi¬ 
nomial distribution on the number of edges, k: 

p{G) cx (20) 

where r is the prior probability that each of m = 
possible undirected edges {i,j) G VxV is included. Since 
r is unknown, we further place a Beta(a, b) prior over r. 
Integrating out r gives the marginal prior over graphs 


1) it does not enforce sparsity in the inverse spectral den¬ 
sity and 2) a prior of this form will remove the simple 
marginal likelihood structure in Eq. Cl that we harness 
for efficient inference. Motivated by our aims to both share 
information across frequencies and maintain the form of 
the marginal, we utilize a piecewise constant prior over 
spectral densities given a graph, G. We partition the in¬ 
terval [0, 27r] into M intervals wi = [O, . ,Wj = 


27r0-l) 27Tj \ 
M ^ M ) 


,WM = 


27 r(M-l) 
M ■ 


2tv 


and then draw 


a separate positive definite Hermitian matrix from a HIWc 
distribution for each interval; 


Sj ~ HIWM W,,G) j = 1,..., M. (22) 


Our resulting spectral density is simply 


p{G) oc 


/3(a k^b m 
P[a,b) 


k) 


( 21 ) 


M 

S{X) = ^ 1 S, VA e [0, 27r]. (23) 

i=i 


where ^(.,.) is the beta function. As explored in Il20l . this 
is a multiplicity correcting prior ll34l over graphs with the 
desirable property of diminishing false positive edge dis¬ 
coveries as extra unconnected nodes are added to the graph. 

4 METHODS FOR SINGLE TIME 
SERIES 

In some applications of interest one observes only a sin¬ 
gle multivariate time series, N = 1, from which the graph 
must be inferred. Two challenges arise in this setting: (1) 
the effective number of observations informing Eq. d is 
just one and (2) the periodogram used in computing 
is noisy regardless of the length of the series, T. The pe¬ 
riodogram is a notoriously poor estimator of the spectral 
density, and when the spectral density itself is of primary 
interest, a common frequentist method is to smooth the pe¬ 
riodogram to obtain a consistent spectral density estimator 
IHElIl. One could imagine using the smoothed pe¬ 
riodogram as a plug-in estimator in Eq. d, scaled by 
the effective degrees of freedom (see the Supplement for 
more details on this plug in estimator for our formula¬ 
tion). An alternative variance-reduction technique is the 
Bartlett method llTSll . that divides the length T series into 
M shorter series of length ^ and averages the resulting 
M periodograms, but at the cost of reduced resolution (i.e., 
number of considered frequencies). This approach mimics 
the implicit smoothing that occurs when we compute the 
periodogram based on N truly independent series each of 
length T, as in Eq. d- 

In contrast to a plug-in estimator, a natural Bayesian ap¬ 
proach enforces smoothing across frequencies via a prior 
distribution over the set of spectral densities ES\ . Previ¬ 
ous approaches have coupled elements of a Cholesky de¬ 
composition of each spectral density matrix across frequen¬ 
cies, however this approach is unsuitable to our case since 


Under this prior, the marginal likelihood for the single 
(N = 1) time series becomes 


p(X|G) 


M 

TT-^Pfl 

J = 1 


h{W„d„G) 

h{W*,S*,G) 


(24) 


where S* = Sj + Y.k=o and W* = Wj + 

J2k=o '^XkewjPk- By setting M = [v/TJ, we obtain an 
asymptotically approximate nonparametric prior distribu¬ 
tion over continuous spectral density matrices: for T large 
enough the prior puts positive support on spectral density 
matrices arbitrarily close to any continuous spectral density 
over [0, 27r]. Furthermore, under this setting as T —> oo, the 
number of Fourier frequencies, and thus number of samples 
Y^k^o within each interval grows as y/T. 


5 INFERENCE 


Bayesian structural learning algorithms for decomposable 
graphs come in two flavors; MCMC samplers and stochas¬ 
tic search procedures ll20l 1^ . By placing decomposable 
graphical inference for time series in the same framework 
as for the i.i.d. case via our analytic p^Ki-N \ G), we can 
easily modify both types of existing methods for the time 
series case. 

Classical MCMC samplers for decomposable graphs sam¬ 
ple from the posterior over graphs via Metropolis-Hastings 
(MH) by proposing single edge addition and deletion 
moves that keep the graph decomposable iflSl 1^ . While 
it is possible to obtain any decomposable graph from any 
other decomposable graph via a sequence of edge addi¬ 
tions and deletions, the path may be hard to reach lead¬ 
ing to prohibitive converge times for even a moderate num¬ 
ber of vertices p. More recent graph samplers add more 
global moves by either randomly generating new decom¬ 
posable graphs IMl or by generating from a Markov chain 










over a junction tree representation of the graph im. To 
compute the MH acceptance ratio, these samplers rely on 
computing ratios of present and proposed marginal likeli¬ 
hoods. For simple edge additions and deletions, this ratio 
simplifies into a function of only the cliques and separators 
that change between moves. For our case, the ratio expands 
into a product over frequencies of the same affected cliques 
and separators, allowing simple modifications to the exist¬ 
ing implementations of these samplers to handle TGMs. 


All current MCMC samplers struggle to scale to even mod¬ 
erate numbers of nodes. In contexts where point estimates 
suffice, we can instead consider stochastic search proce¬ 
dures. We utilize a modihcation of the efficient feature- 
inclusion stochastic search (FINCS) l20l for inference in 
our TGMs. FINCS interleaves three moves: 1) single edge 
addition and deletion moves for local changes to the graph, 
2) global sampling moves where edges are added inde¬ 
pendently to an empty graph and the final graph is trian¬ 
gulated to maintain decomposability, and 3) resampling 
at step t a full graph from a list of past visited models, 
{Gi, G 2 ,..., Gt_i}, in proportion to their posterior prob¬ 
abilities. In steps 1) and 2), to enforce exploration of high 
probability regions, edge additions that tend to continually 
improve the model probability are preferentially selected 
in proportion to a current heuristic estimate of the posterior 
edge probability 




J2k=l ^{i,j}&EtPiXi-.N\Gt)p{Gt) 
ELi PiXv.NlGt) 


(25) 


where Et is the current edge set. Edge deletions are per¬ 
formed proportional to As in MCMC samplers 

HSIEII, the junction tree representation of the graph can 
be efficiently updated after each local move since the two 
graphs only differ by a single clique and its corresponding 
separators, allowing a quick computation of the marginal 
likelihood of a proposed graph in Eq. Importantly, 

the EINCS algorithm depends on the data only through the 
marginal likelihoods of the cliques C —used to compute the 
full graph marginal likelihood—which in our TGM case is 
a product over T frequencies: 


n 


B{Wk,c,S) 

B{wic^6*y 


(26) 


That is, our implementation simply modihes the original 
EINCS definition of the clique marginal likelihood. 


6 SIMULATIONS 

To test our TGM methods, we consider simulated setups for 
both N > 1 and TV = 1 time series each generated from an 
order-1 vector autoregressive process, denoted VAR(l), for 
p = 20 dimensions. Specifically, we simulated data from 
the model 

(27) 


where x{t) € A € and e(t) ~ N{0, Ipxp). The 

inverse spectral density of a VAR(l) process is given by 171 

=I + A'^A + e-^^A + E^A'^. (28) 

Random sparse TGMs were generated by first restricting 
A to be upper triangular. Eollowing the simulated setup in 
ill , we set the diagonal elements to a constant An = .5 and 
sample the upper diagonal elements as Uij ^ ASij, where 
Sij ^ Binomial(p) with p — .2 for all simulations. The 
graph G was then determined by identifying the zeros in 
S{X)~^ using Eq. ( [TS) . Proposed A matrices were accepted 
when both the absolute value of all eigenvalues of A were 
less than one, making the series stationary, and the graph G 
determined by A was decomposable. 

We note that since our formulation reduces to a standard 
structure learning problem, our emphasis is less on assess¬ 
ing performance with respect to p, which should follow 
from whichever structure learning algorithm is selected; 
instead, our focus is on N and T, which are specihc to 
the time series and spectral analysis. Eor example, in the 
EINCS algorithm ll20ll . it is quoted that the method can han¬ 
dle graphs with up to roughly p = 100 nodes. 

6.1 Multiple time series 

To analyze how our TGM structure learning performance 
varies with the number of time series replicates, N, we sim¬ 
ulated data for N G {20,50,100,150,200,250,300,350} 
and T G (25, 50,100500,1000,1500, 2000}. This pro¬ 
cess was repeated 200 times for each combination of N 
and T. Each time series is hrst decomposed into its dis¬ 
crete Eourier components. We then ran 10,000 iterations 
of the EINCS algorithm using the fractional marginal like¬ 
lihood in Eq. ( [T^ with g = ^, a default setting ll29l l20l . 
Our graph prior followed the multiplicity correcting form 
in Eq. (|2T| with a = b = 1. The graph visited with highest 
posterior probability was then selected and true and false 
positive rates were computed. Results are displayed in 
Fig. 0 Across T, the true positive rate increases quickly 
with the number of series, N, achieving an almost per¬ 
fect true positive rate by about N = 150. We also see 
that the rate of increase in the true positive rate increases 
with the length of the series T, which relates to the num¬ 
ber of considered Eourier frequencies. It is interesting to 
note that for all T under consideration, the false positive 
rate tends to start very low (« .005) for = 20 replicates 
then spike at A S {50,100} before declining again. This 
occurs due to the fact that at low N, very few edges are 
introduced at all, perhaps due to an Occam’s razor type ef¬ 
fect of marginal likelihoods penalizing model complexity. 
As N starts to increase, more edges are introduced, both 
correct and incorrect, and as N further increases, the false 
edges are pruned and true edges are retained, leading to a 
decline in the false positive rate. Note that the false posi¬ 
tive spike tends to be more pronounced for time series of 


x{t) = Ax{t — 1) -f e(f), 
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Figure 1 : Top: As a function of the number of time series N, and plotted for various values of their length T, (left) mean 
true positive rate, (middle) median false positive rate, and (right) mean running time computed across the 200 replicates. 
Standard error bars are small relative to the scale of the plots and are omitted for clarity. Bottom: Same plots as a function 
of T for a single time series (N = 1), and plotted for various periodogram smoothing techniques. 


smaller length, T G {25, 50}. One would expect to see 
significant improvements, especially for small JV, by lever¬ 
aging the piecewise constant prior of Sec. and explored 
in Sec. |6.2| where we show that we are able to learn graphs 
from just = 1 time series. However, we chose not to 
include this prior in this analysis so as not to confound its 
effect with our performance. Here, the noisy periodogram 
is smoothed implicitly by averaging over Ai. 

Finally, in Fig. [T] we see that runtime increases as a func¬ 
tion of T due to the dependence on T in the marginal like¬ 
lihood computation of Eq. d, though significant cost re¬ 
ductions can be achieved through parallelizations leverag¬ 
ing the product form. 

6.2 Single time series: comparison of methods 

To assess the performance of our single-time-series meth¬ 
ods outlined in Sec. we simulated a time series with 
T G {500,1000,2500, 5000, 7500,10000}. For the piece- 
wise constant prior method, we use M = [v/TJ pieces. 
We compare against the Bartlett time-series-splitting ap¬ 
proach with the number of splits set to [VT\ ■ We also ex¬ 
amine a smoothed plug-in estimator of the spectral density 
using a Daniell smoother outlined in the Supplement with 
m = L^J for a total window size of 2 -f 1 « ■ 

For each method, the FINCS algorithm was run for 10,000 
iterations and the highest scoring graph was selected and 
used to compute true and false positive rates. This pro¬ 
cess was repeated 200 times with results displayed in Fig.[l] 


with a replicate representative of our median performance 
shown in Fig. The true positive rate increases for all 
three methods as a function of T, achieving a final value of 
about .9 for both the plug-in and piecewise constant prior 
methods and .79 for the Bartlett method at T = 10000. 
All methods maintain a low false positive rate around .02. 
Overall, the Bartlett method performs uniformly worse in 
terms of both true and false positive performance, while the 
piecewise prior method performs on par with the plug-in 
method, but at a fraction of the computational cost. Further 
experimental simulations are given in the Supplement. 

7 GLOBAL STOCK INDICES 

We explore the utility of our method in discovering con¬ 
ditional independencies between countries inherent in the 
global financial system. A similar experiment was con¬ 
ducted in Q using a penalized-likelihood approach to 
learn TGMs, but restricted to finite-order VAR mod¬ 
els with pre-specified order. (Recall that our method 
only assumes Gaussian stationarity, which includes the 
class of possibly infinite order VAR processes.) Using 
www.globalfinancialdata.com, we acquired the 
daily closing prices of 17 stock indices in US dollars for 
various countries around the world (see the Supplement for 
the full list) from June 3, 1997 to June 30, 1999. Missing 
prices were backfilled and only days where all exchanges 
traded were considered which resulted in time series of 
length 542. Following standard practice when analyzing 
stock prices, we converted the closing prices, pt, on day t 
to log-returns according to 
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Figure 2; Example evolution of error types for the piecewise prior method as a function of series length, T G 
{1000, 2500, 5000,10000} and = 1, for a selected graph. Blue, red, black, and white entries indicate true positives, 
false negatives, false positives, and true negatives, respectively. The graph was selected by choosing the graph out of 200 
replications with median true positive rate at T = 2500. 


n = I001og(pt/p(_i). 

We compare the graphical models inferred under two set¬ 
tings: (i) treating the log-returns as independent (as in ||20l) 
and (ii) using our methods to learn a TGM treating the log- 
returns as a time series. The best graphical models learned 
in each scenario are depicted in Fig. 

For our TGM algorithm, we computed the periodogram for 
the 17-dimensional time series, resulting in 542 complex¬ 
valued matrices of dimension 17 x 17. Since we only have 
one realization of the time series, we smoothed the pe¬ 
riodogram using the techniques and settings discussed in 
Sec. |6.2| We then ran the FINCS algorithm for 100,000 
iterations. We compare the resulting highest-probability 
graph (see Fig. to that learned treating the time series 
as independent based on the model in ll20l . again using 
100,000 iterations of the FINCS algorithm, but in its origi¬ 
nally proposed form for non-temporal data. 

In Figurej^ we see that in both cases we recover some geo¬ 
graphical relationships between countries. However, the in¬ 
dependent model returns a significantly denser graph than 
that learned by our TGM approach. Since the independent 
model is not taking the temporal nature of the data into 
account, some edges are likely spurious due to random cor¬ 
relations. The TGM, on the other hand, provides an inter¬ 
pretable and intuitive structure with strong geographic con¬ 
nections. For example, there is a distinct United Kingdom 
/ eurozone cluster of Germany ‘DE’, Finland ‘FT, Nether¬ 
lands ‘NF’, Belgium ‘BE’, Switzerland ‘CH’, Austria ‘AT’, 
Spain ‘ES’, Italy ‘IT’, Portugal ‘PT’, and the United King¬ 
dom ‘UK’. Another distinct cluster includes the United 
States ‘US’, Canada ‘CA’, Hong Kong ‘HK’ (whose cur¬ 
rency is linked to the USD), and Australia ‘AU’ (whose 
currency is correlated with the US S&P), with Japan ‘JP’ 
hanging off this cluster. One perhaps strange missing link 
is between Ireland ‘IE’ and the UK, though the US and Ire¬ 
land have a long history of economic connections possibly 
explaining why Ireland is included in the separator between 
these two distinct clusters. 

In the Supplement, we include (i) a comparison of our 



Figure 3: Graphical models with the highest posterior prob¬ 
ability for the stock index data. Left: Treating the log- 
returns as independent. Right: Using our TGM algorithm. 
In both cases, we see regional connections, but our TGM 
algorithm results in a sparser and more interpretable graph. 

learned graph with that of Songsiri et. al. Q, and (ii) fur¬ 
ther details on the stock data itself. 

8 MAGNETOENCEPHALOGRAPHY 
DATA 

Next we learn TGMs to capture the structure of underlying 
cortical dynamics from magnetoencephalography (MEG) 
data collected from ten subjects who were asked to per¬ 
form a task while maintaining focus on an audio stream 
and then again while switching focus Oil. Our goal is to 
discover differences in the underlying TGMs between the 
non-switching and switching attention conditions. Such 
differences provide further understanding into the neural 
underpinnings of auditory selective selection, an important 
constituent to communication. 

The data were collected for each subject performing the ex¬ 
periment in the switching (S) and non-switching (N) atten¬ 
tion conditions. For both S and N conditions, each subject 
performed the task under an auditory condition of high (U) 
and low (D) pitch, and spatial conditions of left (L) and 
right (R) attending. For each of the eight possible condi¬ 
tions, MEG recordings were collected resulting in a 150- 
dimensional time series of length 992 where each dimen- 








Figure 4; Learned TGMs for different MEG conditions. Each node on the periphery represents a brain region with loca¬ 
tion indicating anatomical location. Top: Intersection of learned edges between switching and non-switching conditions. 
Bottom: Black edges indicating those in the non-switching condition but not in the switching and red vice versa. 


sion corresponds to a localized region of the brain. We 
have between 17 and 30 trials for each subject, resulting in 
about 200 replicate time series per condition. 

Often with MEG data, many of the dimensions are domi¬ 
nated by noise due to limited brain activity in that region. 
We reduced the number of brain regions we studied from 
150 to 50 by only considering those with largest variance. 
In particular, for each trial we mean-centered all of the 
time-series and computed the variance and retained the top 
50 most volatile regions. 

We computed the periodogram for each trial and averaged 
across trials within each condition, resulting in eight peri- 
odograms. We ran our spectral TGS version of the FINCS 
algorithm on these periodograms for 100,000 iterations 
with fractional prior parameter 4/A^c, where Nc is the num¬ 
ber of trials for condition c € {S, N} x {U, D, L, R}. We 
also ran the algorithm for 1.7 million iterations and saw no 
difference in the resulting graphs. 

In Figure]^ we depict the intersections and differences be¬ 
tween the learned graphs for each experimental condition. 
We see in the top row that there are a lot of shared connec¬ 
tions between the switching and non-switching conditions 
for each auditory condition. In the bottom row, the differ¬ 
ences between the switching and non-switching conditions 
are depicted where red edges are those in the switching 
condition but not the non-switching, and black edges are 
the reverse. The difference plots show that there seems to 
be substantial “rewiring” for many of the conditions with 
many edges connecting frontal to back regions. Interest¬ 
ingly, we again see consistencies in these rewirings across 
conditions. Additionally, we reliably uncover local connec¬ 
tions between adjacent brain regions across experimental 
conditions. Such observations provide guidance for devel¬ 
oping experiments and methods to discern the underlying 
mechanisms that give rise to these different structures. 


9 DISCUSSION 

We introduced a Bayesian approach to graphical model 
structure learning for time series. In particular, we propose 
a prior—the hyper complex inverse Wishart distribution— 
for the spectral density matrices in a Whittle likelihood ap¬ 
proximation. For decomposable graphs, this prior is conju¬ 
gate and leads to a closed-form expression of the marginal 
likelihood of the time series given the graph, marginalizing 
the spectral density matrices across frequencies. Being able 
to integrate out this large collection of complex matrices— 
one for each time point—is critical to developing a prac¬ 
tical and scalable inference algorithm. For this, exploiting 
the fact that our marginal likelihood is analogous to that for 
i.i.d. Gaussian graphical models CD but with a product 
over the number of Fourier frequencies, allows us to de¬ 
ploy straightforward modifications to existing MCMC and 
stochastic search algorithms. Our simulations show that 
when many time series are observed, our method recovers 
the correct graph. When a single time series is observed, 
we proposed a method to increase robustness of our graph 
estimation using a piecewise constant prior. Our results on 
the stock and MEG datasets demonstrated our ability to dis¬ 
cover intuitive and interpretable structure in these datasets, 
importantly leveraging the temporal dependencies. 

Extensions to non-decomposable graphs are possible us¬ 
ing the i.i.d. graph approaches in both ED and Ea. 
A Laplace approximation to the marginal likelihood for 
non-decomposable graphs is proposed in ll22l . which 
we could similarly utilize to approximate the frequency- 
specific marginal at each term in Equation ( [l4| l. Paral¬ 
lelizing the Laplace approximation computation across fre¬ 
quencies would lead to a scalable method for inference in 
non-decomposable time series graphs. 
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Bayesian Structure Learning for Stationary Time Series 
Supplementary Material 


Abstract 

In this supplement we provide more background 
on spectral analysis of time series and the com¬ 
plex normal and complex inverse Wishart distri¬ 
butions. The hyper complex inverse Wishart is 
then introduced in more detail and its marginal 
likelihood is derived. We also provide more 
detail about the periodogram smoothing plug 
in method referenced in Section 4 of the main 
text. Finally, we provide details about the global 
stock index data set and compare the underly¬ 
ing conditional independence graphs learned by 
our method with that of Songsiri et al. Q. We 
also provide supplemental simulations showcas¬ 
ing the predictive performance of our approach 
in the frequency domain. 

1 Spectral Analysis Background 

Spectral analysis is an approach to analyzing stationary 
time series where the main object of interest is the spec¬ 
tral density. For many applications, the spectral density is 
often a more informative object about the underlying phys¬ 
ical process than autoregressive coefficients or the lagged 
autocovariance. In particular, it allows one to read off 
which frequency components are prominent in a time se¬ 
ries, and usually certain frequency bands have scientific 
relevance. The theoretical justification for caring about the 
spectral density arises from the classic spectral representa¬ 
tion theorem ll24ll . Informally, this theorem states that un¬ 
der some conditions, any stationary vector stochastic pro¬ 
cess x{t) G can be written as 

x(t)= f e-**^dZ(A) (1) 

J —TT 

where dZ{X) G is an orthogonal increments process 
such that E{Z{X)Z{X)*) = S{X), where S{X) is the spec¬ 
tral density matrix and E{Z{X)Z{X')*) = 0 for A ^ A'. 
Intuitively, this theorem states that the amplitudes of certain 
frequencies in a stationary process are independent across 
frequencies and the within frequency covariance is given by 
the spectral density matrix. As noted in the main text, the 
spectral density matrix also arises as the Fourier transfor¬ 
mation of the lagged auto covariance matrices of the pro¬ 
cess 

OO 

^(A)= ^ r{h)e-^^^ (2) 

h— — oo 


where r(/i) = E{x(t)x{t -f The statistical task then 
becomes estimation of the spectral density matrix S'(A) 
from a finite observed time series x = [x(l),..., x{T)]. The 
most basic estimator is given by the periodogram, defined 
as: 

Pk = dkdl (3) 

where dk is the DFT of the observed time series at Fourier 
frequency Afc, (Eq. (7) of the main text). While this esti¬ 
mator is asymptotically unbiased, it is not consistent since 
its variance does not go to zero. Instead, techniques that 
smooth across nearby frequencies provide consistent esti¬ 
mators of the spectral density, and are commonly used in 
practice ll3^ . Finally, the periodogram estimates at dif¬ 
ferent frequencies, Pk and are asymptotically uncorre¬ 
lated, providing some intuition as to why the Whittle ap¬ 
proximation decomposes into independent terms for each 
frequency J^. More details on the spectral approach to 
analyzing stationary time series is provided in Brillinger, 

2001 ED. 

2 Smoothing the Periodogram for a Single 
Time Series via the Plug in Method 

In this section we provide more details on the plug in 
method for smoothing the periodogram obtained from a 
single realization of a multivariate time series mentioned in 
Sec. 4 of the main text. First we provide some background 
on classical frequentist approaches to smoothing the pe¬ 
riodogram to obtain consistent estimators of the spectral 
density (as T increases). 

When the spectral density itself is the primary object of 
interest, a common frequentist method is to smooth the pe¬ 
riodogram to obtain a consistent estimator of the spectral 
density: 

SiXk) = ^T{j)Pk+j (4) 

|j|<m 

where Pk is the periodogram at frequency Afc as introduced 
in the main text and Wrij) > 0, X]|j|<m ^tU) = 1 are 
some smoothing weights for a length T series and m is the 
smoothing window. This approach was used in the frequen¬ 
tist graph estimation frameworks in lfT4l [131 [8l . To ensure 
consistency as T —>■ oo we must have m -G oo, ^ —t 0, 
J2\j\<m^T{j)'^ —>■ 0 li24l . The asymptotic variance 

of Sk scales as X)|j|<m ^tU)’ implying that the asymp¬ 
totic effective sample size for a smoothed estimate of this 
form is iJ2\j\<m^TU))~^ li24l . The Daniell smoother 
corresponds to taking WtU) = 2 rn+i ™ intuitive 



(effective) sample size of 2m + 1, the size of the smooth¬ 
ing window. Intuitively, this holds asymptotically since as 
T —> oo the sample periodograms become independent at 
different frequencies implying a sample size of 2m + 1, the 
number of (asymptotically) independent samples. 

Inspired by the use of this smoothing technique in previous 
TGM procedures O [T3l H] we develop a similar proce¬ 
dure tailored to our objective function in Eq. ( [T4l l. We 
plug in a smoothed estimate of the spectral density ma¬ 
trix, scaled by the asymptotic effective degrees of freedom, 
for the priodogram, Pk, in Eq. Specihcally, we set 

= Wk + iJ2\j\<m The degrees of free¬ 

dom parameter is similarly updated by adding the ef¬ 
fective sample size of the smoother to the prior degrees 
of freedom: 5* = 4 -p (I]|j|<m we use 

the Daniell smoother outlined above the updates become 
W: = Wk + Eb|<„^ Pk+j and 4 = 4 + 2m + 1. In 
practice we set m = to ensure that the conditions for 

consistency of Sk are met. 


with normalization constant 
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Note that we have used an alternative parameterization 
of the inverse Wishart distribution commonly used in the 
graphical modeling literature US). The marginal distri¬ 
bution of Sa where A C {1,... ,p} is given by ^ 

IW,{6,Wa). 


4 Marginal Likelihood for the Hyper 
Complex Inverse Wishart 

We dehne the hyper-complex inverse Wishart distribution 
for a graph G = {V, E} in the main paper as the restriction 
of the complex inverse Wishart distribution to S S 
with a zero pattern in specified by G. Its density is 
given by: 


3 The Complex Normal and Complex 
Inverse Wishart Distributions 


pimW,G) = ls6M+(G)W<5,G)|E|-(^+2p)e-'^'^^" 

( 8 ) 


The complex normal distribution is a generalization of the 
multivariate normal distribution to the complex domain. 
Let Z G be a complex random variable. Z is dis¬ 
tributed as a complex normal distribution, A/'cjO, E), with 
zero mean and complex Hermitian positive dehnite covari¬ 
ance matrix E S if it has density given by 

pW = ’’’ ® 

where z* = denotes the conjugate transpose of z. If 
Z ~ A^c(0, E) then the distribution over Z can be repre¬ 
sented equivalently as a joint distribution over the real and 
imaginary elements of Z = X + iY, X,Y € 


where h{W, 6, G) is a normalization constant and M'^iG) 
is the set of positive dehnite matrices with zeros in their 
inverse that obey the conditional independence properties 
of G. 

Due to the fact that the complex inverse Wishart distribu¬ 
tion is conjugate to the complex normal distribution for an 
unrestricted E, by Proposition 5.1 in M it follows that 
the hyper complex inverse Wishart distribution is a strong 
hyper-Markov distribution. It follows that for decompos¬ 
able G the complex hyper inverse Wishart density can be 
written in terms of the cliques, C, and separators, S, of G: 


p{Y\d,W,G) = 

1i:gm+(G) 


TIggC 

risGs 


p{Yc\Wc,S) 

p(Es|lCs,(5) 


(9) 
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ReE -ImE 
ImE ReE 


(6) 


where ReE and ImE indicate the real and imaginary com¬ 
ponents of E, respectively. Thus we see that the real and 
imaginary components are independent iff ImE = 0. As 
in the non-complex case, the marginal likelihood of Xa 
for some subset of nodes A C {!,... ,p}, is given by 
Xa ~ A/'c(0, E^), where E^ is the matrix formed by se¬ 
lecting the rows and columns of E in A. 

The conjugate prior distribution for E is given by the com¬ 
plex inverse Wishart, E ^ IWc{S, W), with degrees of 
freedom parameter i5 > 0 and centering matrix W G 
Hermitian positive dehnite. Its density is given by 


p(E|kP,4 = 


where p{Yjc\Wc,5) is the unrestricted complex inverse 
Wishart density for Ec. This decomposition implies that 
the normalization constant for Equation (|^ is also given 
by the ratio of complex inverse Wishart normalization con¬ 
stants for cliques and separators 


h[W, S, G) 




( 10 ) 


If Zi,...Za! * A4:(0, E), then the joint distribution of 

Zi,..., Zm, and E can be written as: 


li:eM+(G) 


p(zi,...,Ziv,E|G,cx 

h{W, 5, G) , 
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( 11 ) 

( 12 ) 


( 7 ) 














6 Global Stock Indices 


We note that the part dependent on S is the kernel for 
a HIWc{W + X]i=i distribution, it fol¬ 

lows that the marginal distribution of Zi,..., Zn\G,W,S 
is given by the ratio of prior and posterior normalization 
constants of the complex hyper inverse Wishart distribution 
times a likelihood constant: 


p{zi,. . .,Zn\G,W,d) 


_ h{W, 6, G) _ 

TT^Ph{W + <5 + Af, G) ■ 

(13) 


4.1 Marginal Whittle Likelihood 

The model in the main text places independent 
HIWc{Wk,Sk,G) priors on each spectral density 
matrix in the Whittle likelihood, Sk HlWdWk, Sk,G) 
Vk £ [T — 1]. Applying the above marginal likelihood 
result to each frequency component in the Whittle approx¬ 
imation shows that the marginal likelihood of the data 
given a graph, a set of centering matrices ,Wo,..., Wt-i, 
and degrees of freedom. So, ■■ ■ St-i, for each frequency 
can be approximated by a product of the normalization 
constants across frequencies: 


6.1 Learned Graph Comparison with Previous 
Methods 

For comparison with the method of Songsiri et al. jT!, 
we provide the CIG graphs learned on the international 
stock data set using both their autoregressive method and 
our nonparametric Bayesian method in Figure [T] Further 
details on the meaning of the edge weights can be found 
in El. Notice that both graphs capture similar structure, 
for instance the connections between the US, Canada, Aus¬ 
tralia, and Japan. Additionally, both graphs contain tight 
clusters containing European countries. 

6.2 Stock Data 

In Table we list the stock indices that were used 
in the main paper. The data was downloaded from 
globalf inancialdata. com for the dates June 3, 
1997 to June 30, 1999. 


F(Xi:Jv|G) 


T-1 

-NTp 

fc =0 


h(Wk,Sk,G) 

h(w:,s*„G)- 


(14) 


where = 114 + Pk and S^ = Sk + N. Indeed, this 
derivation shows that our prior specification for spectral 
density matrices is conjugate to the entire Whittle likeli¬ 
hood. 


5 Prediction Simulations 

To further validate our approach we analyze predictive per¬ 
formance in the frequency domain on simulated VAR(l) 
data as described in Section 6 of the main text. We set the 
dimension to p = 10 and time series length to T = 2500. 
We split the simulated series in half forming a training 
set and test set and then learn a graph and smoothed pe- 
riodogram on the test set of the time series. We use the 
Whittle marginal likelihood presented in Eq. to com¬ 
pare predictions between four graphs: the learned graph in 
the frequency domain, Gspectrai. the learned iid graph that 
treats the time series as independent observations, Gm , the 
full graph, Gfuii, with all edges included, and the empty 
graph, Gempty. with no edges. Eor prediction, the prior cen¬ 
tering matrix, 114, is given by the Daniell smoothed peri- 
odogram on the training data with its respective degrees of 
freedom, and is given by = Wk + where 
is the periodogram on the test data. Results are dis¬ 
played in Table 1. We see that the learned spectral graph 
does significantly better than the other graphs at prediction. 





G spectral 

Giid 

^empty 

Gfuii 

Loglik. 

-81623 ±205 

-87712 ±216 

-90040 ±211 

-103964 ± 304 


Table 1: Predictive marginal log-likelihood for simulated VAR data in the frequency domain on a held out test set using 
the marginal Whittle likelihood in Eq. ( |T4| ) under four different graphs (± indicates 1 standard error across simulation 
replicates). 




Figure 1: CIG graph learned on international stock data using (left) method presented in this paper and (right) the autore¬ 
gressive method presented in Songsiri et al. ||7l (Figure taken directly from fTI). 


Table 2: Stock index information. 

Index Name Ticker Country Country Code 


Amsterdam Exchange Index 

AEX 

Netherlands 

NE 

All Ordinary Composite 

AORD 

Australia 

AU 

Austrian Traded Index 

ATX 

Austria 

AT 

BEL 20 

BEX 

Belgium 

BE 

CAC40 

FCHI 

France 

FR 

FTSEMIB 

FTMIB 

Italy 

IT 

FTSE 100 

FTSE 

United Kingdom 

UK 

DAX30 

GDAX 

Germany 

GE 

Toronto Stock Exchange 300 

GSPTSE 

Canada 

CA 

Hang Seng Composite 

HSI 

Hong Kong 

HK 

IBEX 35 

IBEX 

Spain 

SP 

Irish Stock Exchange Index 

ISEQ 

Ireland 

IR 

Nikkei 225 

N225 

Japan 

JP 

OMX Helsinki 25 

OMXH25 

Finland 

FN 

Portugal Stock Index 

PSI20 

Portugal 

PO 

S&P500 

SPX 

United States 

US 

Swiss Market Index 

SSMI 

Switzerland 

CH 
















